nlp_architect.solutions.trend_analysis package

Submodules

nlp_architect.solutions.trend_analysis.np_scorer module

class nlp_architect.solutions.trend_analysis.np_scorer.NPScorer(parser=None)[source]

Bases: object

score_documents(texts: list, limit=-1, return_all=False, min_tf=5)[source]

nlp_architect.solutions.trend_analysis.scoring_utils module

class nlp_architect.solutions.trend_analysis.scoring_utils.CorpusIndex(documents: list, spans: list)[source]

Bases: object

Text span index class. Holds TF and DF values per span. Text spans are normalized and similar spans are mapped to the same TF DF values.

df(phrase)[source]

Get DF of phrase in corpus

static get_docid(d: spacy.tokens.doc.Doc)[source]

get doc id

get_phrase(pid)[source]
static get_pid(p: spacy.tokens.span.Span)[source]

get phrase id

get_subphrases_of_word(w)[source]
static get_wid(w: spacy.tokens.token.Token)[source]

get word id

static hash_func(x)[source]
tf(phrase)[source]

Get TF of phrase in doc

class nlp_architect.solutions.trend_analysis.scoring_utils.TextSpanScoring(documents, spans, min_tf=1)[source]

Bases: object

Text spans scoring class. Contains misc scoring algorithms for scoring text fragments extracted from a corpus.

Parameters:
  • documents (list) – List of spaCy documents.
  • spans (list[list]) – List of spaCy spans representing noun phrases of documents document.
doc_text_spans
documents
get_cvalue_scores(group_similar_spans=True)[source]
get_freq_scores(group_similar_spans=True)[source]
get_tfidf_scores(group_similar_spans=True)[source]

Get TF-IDF scores of spans return a list of spans sorted by desc order of importance span score = TF (global) * (1 + log_n(DF/N))

group_spans(phrases)[source]
static interpolate_scores(phrase_lists, weights=None)[source]
static multiply_scores(phrase_lists)[source]
static normalize_l2(phrases_list)[source]
static normalize_minmax(phrases_list, invert=False)[source]
re_index_min_tf(tf)[source]

nlp_architect.solutions.trend_analysis.topic_extraction module

nlp_architect.solutions.trend_analysis.topic_extraction.create_w2v_model(text_list_t, text_list_r)[source]

Create a w2v model on the given corpora

Parameters:
  • text_list_t – A list of documents - target corpus (List[String])
  • text_list_r – A list of documents - reference corpus (List[String])
nlp_architect.solutions.trend_analysis.topic_extraction.get_urls_from_file(file)[source]

Merge two corpora into a single text file

Parameters:
  • corpus_a – A folder containing text files (String)
  • corpus_b – A folder containing text files (String)
Returns:

The path of the unified corpus

nlp_architect.solutions.trend_analysis.topic_extraction.initiate_parser()[source]
nlp_architect.solutions.trend_analysis.topic_extraction.load_text_from_folder(folder)[source]

Load files content into a list of docs (texts)

Parameters:folder – A path to a folder containing text files
Returns:A list of documents (List[String])
nlp_architect.solutions.trend_analysis.topic_extraction.load_url_content(url_list)[source]

Load articles content into a list of docs (texts)

Parameters:url_list (List[String]) – A list of urls
Returns:A list of documents (List[String])
nlp_architect.solutions.trend_analysis.topic_extraction.main(corpus_t, corpus_r, single_thread, no_train, url)[source]
nlp_architect.solutions.trend_analysis.topic_extraction.noun_phrase_extraction(docs, parser)[source]

Extract noun-phrases from a textual corpus

Parameters:
  • docs (List[String]) – A list of documents
  • parser (SpacyInstance) – Spacy NLP parser
Returns:

List of topics with their tf_idf, c_value, language-model scores

nlp_architect.solutions.trend_analysis.topic_extraction.save_scores(np_result, file_path)[source]

Save the result of a topic extraction into a file

Parameters:
  • np_result – A list of topics with different score types (tfidf, cvalue, freq)
  • file_path – The output file path
nlp_architect.solutions.trend_analysis.topic_extraction.train_w2v_model(data)[source]

Train a w2v (skipgram) model using fasttext package

Parameters:data – A path to the training data (String)
nlp_architect.solutions.trend_analysis.topic_extraction.unify_corpora_from_folders(corpus_a, corpus_b)[source]

Merge two corpora into a single text file

Parameters:
  • corpus_a – A folder containing text files (String)
  • corpus_b – A folder containing text files (String)
Returns:

The path of the unified corpus

nlp_architect.solutions.trend_analysis.topic_extraction.unify_corpora_from_texts(text_list_t, text_list_r)[source]

Merge two corpora into a single text file

Parameters:
  • text_list_t – A list of documents - target corpus (List[String])
  • text_list_r – A list of documents - reference corpus (List[String])
Returns:

The path of the unified corpus

nlp_architect.solutions.trend_analysis.topic_extraction.write_folder_corpus_to_file(corpus, writer)[source]

Merge content of a folder into a single text file

Parameters:
  • corpus – A folder containing text files (String)
  • writer – A file writer
nlp_architect.solutions.trend_analysis.topic_extraction.write_text_list_to_file(text_list, writer)[source]

Merge a list of texts into a single text file

Parameters:
  • text_list – A list of texts (List[String])
  • writer – A file writer

nlp_architect.solutions.trend_analysis.trend_analysis module

nlp_architect.solutions.trend_analysis.trend_analysis.analyze(target_data, ref_data, tar_header, ref_header, top_n=10000, top_n_vectors=500, re_analysis=False, tfidf_w=0.5, cval_w=0.5, lm_w=0)[source]

Compare a topics list of a target data to a topics list of a reference data and extract hot topics, trends and clusters. Topic lists can be generated by running topic_extraction.py

Parameters:
  • target_data – A list of topics with importance scores extracted from the tagret corpus
  • ref_data – A list of topics with importance scores extracted from the reference corpus
  • tar_header – The header to appear for the target topics graphs
  • ref_header – The header to appear for the reference topics graphs
  • top_n (int) – Limit the analysis to only the top N phrases of each list
  • top_n_vectors (int) – The number of vectors to include in the scatter
  • re_analysis (Boolean) – whether a first analysis has already been made or not
  • tfidf_w (Float) – the TF_IDF weight for the final score calculation
  • cval_w (Float) – the C_Value weight for the final score calculation
  • lm_w (Float) – the Language-Model weight for the final score calculation
nlp_architect.solutions.trend_analysis.trend_analysis.calc_scores(scores_file, tfidf_w, cval_w, lm_w, output_path)[source]
Given a topic list with tf_idf,c_value,language_model scores, compute
a final score for each phrases group according to the given weights
Parameters:
  • scores_file (String) – A path to the file with groups and raw scores
  • tfidf_w (Float) – the TF_IDF weight for the final score calculation
  • cval_w (Float) – the C_Value weight for the final score calculation
  • lm_w (Float) – the Language-Model weight for the final score calculation
  • output_path – A path for the output file of final scores (String)
nlp_architect.solutions.trend_analysis.trend_analysis.clean_group(phrase_group)[source]

Returns the shortest element in a group of phrases

Args:
phrase_group (String): a group of phrases separated by ‘;’
Returns:
The shortest phrase in the group (String)
nlp_architect.solutions.trend_analysis.trend_analysis.compute_scatter_subwords(top_groups, w2v_loc)[source]

Compute 2D vectors of the provided phrases groups

Parameters:
  • top_groups – A list of group-representative phrases (List(String))
  • w2v_loc – A path to a w2v model (String)
Returns:

A tuple (phrases, x, y, n) WHERE: phrases: A list of phrases that are part of the model x: A DataFrame column as the x values of the phrase vector y: A DataFrame column as the y values of the phrase vector n: The number of computed vectors

nlp_architect.solutions.trend_analysis.trend_analysis.merge_phrases(data, is_ref_data, hash2group, rep2rank, top_n, topics_count)[source]

Analyze the provided topics data and detect trends (changes in importance)

Parameters:
  • data – A list of topics with importance scores
  • is_ref_data (Boolean) – Was the data extracted from the target/reference corpus
  • hash2group – A dictionary storing the data of each topic
  • rep2rank – A dict of all groups representatives and their ranks
  • top_n (int) – Limit the analysis to only the top N phrases of each list
  • topics_count (int) – The total sum of all topics extracted from both corpora
nlp_architect.solutions.trend_analysis.trend_analysis.save_report_data(hash2group, groups_r_sorted, groups_t_sorted, trends_sorted, all_topics_sorted, create_clusters, target_header, ref_header, tfidf_w, cval_w, freq_w, in_model_count, top_n_scatter)[source]
nlp_architect.solutions.trend_analysis.trend_analysis.save_scores_list(scores, file_path)[source]

Save the list of topics-scores into a file

Parameters:
  • scores – A list of topics (groups) with scores
  • file_path – The output file path

Module contents